TaxE: a Testbed for Hierarchical Document Classifiers

نویسندگان

  • Paolo Avesani
  • Christian Girardi
  • Nicola Polettini
  • Diego Sona
چکیده

In the last decade the interest in the hierarchical organization of documents is increased. New challenges arise as hierarchical document classification, both unsupervised and supervised. A recognition of the most recent literature on these topics shows that none of the published works refer to the same dataset to enable the experimental phase. Moreover the papers don’t provide enough details to reproduce the same datasets starting from the same information sources. The drawback is twofold: from one hand the waste of time to preprocess suitable datasets, to the other hand the lack of a common testbed to compare alternative solutions. In this paper we propose a dataset extracted from Google and LookSmart web directories to support the experimentation effort in the field of hierarchical document classification. For such a task we aim to provide a kind of reference corpus in analogy with the role that Reuters plays in the scientific community. The paper illustrates the process performed to generate a well defined dataset. This dataset is freely distributed over the web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...

متن کامل

Performance measurement framework for hierarchical text classification

Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories an...

متن کامل

Applications of Machine Learning to Information Access

The recent explosion of on-line information has given rise to a number of query-based search engines (e.g., Alta Vista) and manually constructed topic hierarchies (e.g., Yuhoo!). But with the current rate of growth in the amount of available information, query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense information bottleneck. Therefore,...

متن کامل

Some Issues in the Automatic Classification of U.S. Patents

The classification of U.S. patents poses some special problems due to the enormous size of the corpus, the size and complex hierarchical structure of the classification system, and the size and structure of patent documents. The representation of the complex structure of documents has not received a great deal of previous attention, but we have found it to be an important factor in our work. We...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004